Brandon Greenawalt
Data Science Technologist
Center for Social Research
R is fast becoming a general programming environment
R is an object-oriented programming language.
Cost
Power
Open
Maintainers/Package Creators
RStudio, Inc is a company dedicated to all things R
Rstudio is also the single largest contributor to R
It was founded by J.J. Allaire
Hadley Wickham is the Chief Scientist
It is also an integrated development environment (IDE) for R
Before RStudio, there were other options beside using the console.
- Notepad++
- Tinn-R
- R-Commander
In scripting with RStudio, you are getting:
- Code completion (use tab to autocomplete anything)
- Code highlighing
- Code diagnostics/warnings
- Code snippets (tab for apply and loops)
- Easily accessible help files (F1 on any function)
- Code tidying (Ctrl + Shift + A)
- More shortcuts than you can learn (Alt + Shift + K)
- Automatic pairing of closures (or...ruining your typing)
In addition to R scripts, RStudio offers optimized editing for:
RStudio is also an excellent tool for reproducible research.
All tabs opened will remain open when you revisit the project.
You can have multiple projects running at the same time
Help you get more organized.
Help you get more reproducible.
Everything in R is an object.
You must create an object and you can then call on the object.
numList = 1:5
numList## [1] 1 2 3 4 5
numList * 5## [1] 5 10 15 20 25
R has many different kinds of objects:
Item
Numeric
Character
Factor/ordered
Data
Data frame
Matrix
List
Tibble
Because R creates objects, each object can be referenced through an index.
Like many other languages, an object’s index is generally accessed using []:
numList[1:3]## [1] 1 2 3
numList[1:3] * 5## [1] 5 10 15
For named objects, we can use the $:
head(mtcars$mpg)## [1] 21.0 21.0 22.8 21.4 18.7 18.1
Just like matrix algebra and dimensional lumber – obj[rows, columns]
mtcars[1, ]## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21 6 160 110 3.9 2.62 16.46 0 1 4 4
head(mtcars[, 1])## [1] 21.0 21.0 22.8 21.4 18.7 18.1
mtcars[1, 1]## [1] 21
Like any other language (or program, for that matter), R has the ability to use operators:
mtcars$mpg[mtcars$cyl == 6 | mtcars$cyl == 8 & mtcars$hp >= 146]## [1] 21.0 21.0 21.4 18.7 18.1 14.3 19.2 17.8 16.4 17.3 15.2 10.4 10.4 14.7
## [15] 15.5 15.2 13.3 19.2 15.8 19.7 15.0
And math functions:
sqrt((2 + 2)^2 * (7 / (2 - 1))) * pi## [1] 33.24749
Even with all of the packages that R has, base R is still extremely powerful by itself.
str(numList)## int [1:5] 1 2 3 4 5
summary(numList)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 2 3 3 4 5
mean(numList)## [1] 3
cor(mtcars$mpg, mtcars$wt)## [1] -0.8676594
lm(mpg ~ wt, data = mtcars)##
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
##
## Coefficients:
## (Intercept) wt
## 37.285 -5.344
plot(mtcars$wt, mtcars$mpg, pch = 19)R allows you to combine functions:
plot(mtcars$wt, mtcars$mpg, pch = 19)
lines(lowess(mtcars$wt, mtcars$mpg), col = "#FF6600", lwd = 2)
abline(lm(mpg ~ wt, data = mtcars), col = "#0099ff", lwd = 2)The Comprehensive R Archive Network is the “official” package repository for R.
CRAN Task Views allow you to see a variety of functions associated with topics.
| Task View Examples | Example Packages |
|---|---|
| Econometrics | wbstats & plm |
| Finance | quantmod & urca |
| Machine Learning | rpart & caret |
| Natural Language Processing | tm & koRpus |
| Psychometrics | lavaan & mirt |
| Spatial | sp & rgdal |
| Time Series | zoo & forecast |
From CRAN
install.packages(c("devtools", "dplyr"))From GitHub:
devtools::install_github("hadley/httr")install.packages("tidyverse")library(tidyverse)Note: Not for CRC cluster use
| Package | Use Case |
|---|---|
| ggplot2 | data visualisation |
| dplyr | data manipulation |
| tidyr | data tidying |
| readr | data import |
| purrr | functional programming |
| tibble | tibbles, a modern re-imagining of data frames |
| Package | Use Case |
|---|---|
| hms | times |
| stringr | strings |
| lubridate | date/times |
| forcats | factors |
| Package | Use Case |
|---|---|
| DBI | databases |
| haven | SPSS, SAS and Stata files |
| httr | web apis |
| jsonlite | JSON |
| readxl | .xls and .xlsx files |
| rvest | web scraping |
| xml2 | XML |
| Package | Use Case |
|---|---|
| modelr | simple modelling within a pipeline |
| broom | turning models into tidy data |
install.packages("tidyverse")library(tidyverse)We saw a glimpse of what base R has to offer in terms of data manipulation.
As powerful as the indexing approach may be, it can often be messy and slightly confusing to someone who may be interested in using your code (or the future you).
### NICE R DATA ###
# numeric indexes; not conducive to readibility or reproducibility
newData = mtcars[, 1:4]
# explicitly by name; fine if only a handful; not pretty
newData = mtcars[, c('mpg','cyl', 'disp', 'hp')]
### MEAN REAL DATA ###
# two step with grep (searching with regular expressions)
cols = c('ID', paste0('X', 1:10), 'var1', 'var2',
grep("^Merc[0-9]+", colnames(oldData), value = TRUE))
newData = oldData[, cols]
# or via subset
newData = subset(oldData, select = cols)What if you also want observations where Z is Yes, Q is No, and only the last 50 of those results, ordered by var1 (descending)?
# three operations and overwriting or creating new objects if we want clarity
newData = newData[oldData$Z == 'Yes' & oldData$Q == 'No', ]
newData = tail(newData, 50)
newData = newdata[order(newdata$var1, decreasing = TRUE), ]And this is for fairly straightforward operations.
The dplyr package was created to make data manipulation easier.
newData = oldData %>%
filter(Z == 'Yes', Q == 'No') %>%
select(num_range('X', 1:10), contains('var'), starts_with('Merc')) %>%
tail(50) %>%
arrange(desc(var1))mtcars %>%
filter(am == 0) %>% # Automatic transmission
select(mpg, cyl, hp, wt) %>%
mutate(rawWeight = wt * 1000) %>%
group_by(cyl) %>%
summarize_all(funs(mean)) ## # A tibble: 3 × 5
## cyl mpg hp wt rawWeight
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 4 22.900 84.66667 2.935000 2935.000
## 2 6 19.125 115.25000 3.388750 3388.750
## 3 8 15.050 194.16667 4.104083 4104.083
x = c(1, 2, NA, NA, 5, 6, NA, 8, NA, NA)
y = c(NA, NA, 3, 4, NA, NA, NA, NA, NA, NA)
z = c(NA, NA, NA, NA, NA, NA, 7, NA, 9, 10 )
coalesce(x, y, z)## [1] 1 2 3 4 5 6 7 8 9 10
In the previous snippet, you hopefully noticed the %>%.
It is included in dplyr, but it originates in magrittr.
It is pronounced as pipe and is functionally equivalent to the Unix |
Old-school R:
ceiling(mean(abs(sample(-100:100, 50))))Piping:
-100:100 %>%
sample(50) %>%
abs %>%
mean %>%
ceilingBoth are valid, but one is just a bit easier for human eyes and easier to code.
We have only really seen the tip of the iceberg with regard to what R has to offer.
Do take some time to look through the CRAN Task Views.
The RBloggers website always has new and neat stuff.
Daily and weekly trending repositories on GitHub are also enlightening.
##
## If you think you can learn all of R, you are wrong. For the foreseeable
## future you will not even be able to keep up with the new additions.
## -- Patrick Burns (Inferno-ish R)
## CambR User Group Meeting, Cambridge (May 2012)
library(plotly)
plot_ly(economics, x = ~date, y = ~uempmed) %>%
add_trace(y = ~fitted(loess(uempmed ~ as.numeric(date))), x = ~date) %>%
layout(title = "Median duration of unemployment (in weeks)", showlegend = FALSE) %>%
dplyr::filter(uempmed == max(uempmed)) %>%
layout(annotations = list(x = ~date, y = ~uempmed, text = "Peak", showarrow = T))library(plotly)
df <- read.csv('https://raw.githubusercontent.com/plotly/datasets/master/2011_february_us_airport_traffic.csv')
df$pop = maps::us.cities$pop[match(paste(df$city, df$state), maps::us.cities$name)]
df$hover <- with(df, paste(airport, city, '<br>',
"Population: ", pop, '<br>',
"Arrivals: ", cnt))
# marker styling
m <- list(
colorbar = list(title = "Incoming flights February 2011"),
size = scales::rescale(df$pop, c(5, 20)), opacity = 0.5, border='rgba(0,0,0,0)'
)
# geo styling
g <- list(
scope = 'usa',
projection = list(type = 'albers usa'),
showland = TRUE,
landcolor = toRGB("gray95"),
subunitcolor = toRGB("gray85"),
countrycolor = toRGB("gray85"),
countrywidth = 0.5,
subunitwidth = 0.5,
bgcolor='rgba(0,0,0,0)'
)
plot_ly(df, lat = ~lat, lon = ~long, text = ~hover, color = ~cnt, marker=m,
type = 'scattergeo', locationmode = 'USA-states', mode = 'markers', colors='RdBu',
width=1000) %>%
layout(title = 'Most trafficked US airports<br>(Hover for airport)', geo = g,
paper_bgcolor='rgba(0,0,0,0)',
plot_bgcolor='rgba(0,0,0,0)',
font=list(color=toRGB("gray85"))
)library(ggplot2)
p = ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point(aes(text = paste("Transmission:", as.factor(am))), size = 2) +
geom_smooth(aes(colour = as.ordered(cyl), fill = as.ordered(cyl)),
show.legend = FALSE) +
facet_grid(. ~ cyl) +
scale_color_brewer(palette = "Dark2") +
scale_fill_brewer(palette = "Dark2") +
#scale_colour_discrete(name = "Cylinders") +
lazerhawk::theme_trueMinimal()
ggplotly(p)DT::datatable(head(mtcars1), filter = "top")Markdown is a markup language.
Now one can intermingle R with markdown, html, css, JavaScript, \(\LaTeX\) and others resulting in a variety of products.
Rstudio and Rmarkdown make it easy to construct:
`r lmSum = summary(lm(mpg ~ wt, data = mtcars))
if (lmSum$coefficients[2, 4] < .05) {
paste("Weight's coefficient of",
round(lmSum$coefficients[2], 3),
"is significant", sep = " ")
} else {paste("Weight's coefficient of",
round(lmSum$coefficients[2], 3),
"is not significant", sep = " ")}`## [1] "Weight's coefficient of -5.344 is significant"
A good man once said:
You, my dear sir, are but a mere bootless beef-witted bugbear and I bid you a good day.
paste(sample(c('artless','bawdy','beslubbering','bootless'), 1),
sample(c('base-court','bat-fowling','beef-witted','beetle-headed'), 1),
sample(c('apple-john','baggage','barnacle','bladder','boar-pig'), 1))## [1] "beslubbering bat-fowling barnacle"
RStudio wants everything to be easy for us as R users.
They provide a series of cheat sheets as reference material.
Data Visualization
Data Wrangling
R Markdown
Package Development
Shiny
Seth Berry @ Mendoza College of Business
Michael Clark @ CSCAR, U of Mich
Anshumaan Bajpai @ Center for Social Research
Center for Digital Scholorship